11/10/2020

Agenda

  • Regression trees
  • Classification trees
  • Bagging and Random Forests
  • Boosting trees
  • Variable Importance measures

Recap

,

  • Deep trees: flexible fitters (non-linearities, interactions), can overfit
  • Shallow trees: not so flexible, can underfit
  • Ensemble methods: we can improve the bias-variance tradeoff at the expense of some loss of interpretation

Bagging

Recall Bootstrap

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method. We introduce it here because it is particularly useful and frequently used in the context of decision trees.

  • Recall that given a set of \(n\) independent observations \(X_1, \dots, X_n\), each with variance \(\sigma^2\), the variance of the mean \(\overline{X}\) of the observations is given by \(\frac{\sigma^2}{n}\). In other words, averaging a set of observations reduces variance

  • Now imagine that these “observations” are the predictions given by a statistical methods. Of course, this is not practical because we generally do not have access to multiple training sets to average over

Bagging

However, we can bootstrap by taking repeated samples from the (single) training data set.

To Bootsrap Aggregate (Bag) we:

  • Take \(B\) bootstrap samples from the training data, each of the same size as the training data. This will give us \(B\) training data sets
  • Fit a large tree to each bootstrap sample. This will give us \(B\) trees
  • Make prediction for a new point \(x\), say \(\hat{f}^{(b)}(x)\) for each one of the \(B\) trees
  • Combine the results from each of the \(B\) trees to get an overall prediction, i.e. \[\hat{f}_{bag}(x) = \frac{1}{B} \sum_{b = 1}^B \hat{f}^{(b)}(x)\]

How to average?

  • For numeric \(y\) we can combine the results easily by making our overall prediction the average of the predictions from each of the \(B\) trees
  • For categorical \(y\), it is not quite so obvious how you want to combine the results from the different trees:
    • Majority vote: get a prediction for \(x\) from each tree, and the class that gets the most “votes” (out of \(B\) “ballots”) is the predicted class
    • Average the probabilities \(\hat{p}^{(b)}\) from each tree (what most software does)

Why does it work?

We “randomize” our data and then build a lot of big, and hence noisy, trees:

  • The relationships which are real get captured in a lot of trees and hence do not wash out when we average
  • Stuff that happens “by chance” is idiosyncratic to one (or a few) trees and washes out in the average
  • You need \(B\) big enough to get the averaging to work, but it does not seem to hurt if you make \(B\) bigger than that
  • The cost of having very large \(B\) is in computational time

Out-of-Bag Error Estimation

  • It turns out that there is a very straightforward way to estimate the test (out-of-sample) error of a bagged model
  • One can show that, on average, each bagged tree makes use of about \(2/3\) of the observations
  • The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations
  • We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around \(B/3\) predictions for the \(i^{th}\) observation, which we average
  • This estimate is essentially the leave one out CV error for bagging, if \(B\) is large
  • By carefully keeping track of which bagged trees use which observations you can get out-of-sample predictions

Out-of-Bag Error Estimation

  1. For \(b = 1, \dots, B\):
    • Take a bootstrap sample \(\mathbf{x}^{(b)} = (x_1^{(b)}, \dots, x_n^{(b)})\) to use as a training set
    • Fit a regression tree \(\hat{f}^{(b)}\)
  2. For \(i = 1, \dots, n\):
    • Set \(S = \{b : (x_i, y_i) \notin \mathbf{x}^{(b)}\}\)
    • Calculate \[OOB_i = \left(y_i - \frac{1}{|S|} \sum_{b \in S} \hat{f}^{(b)}(x_i) \right)^2\]
  3. Final OOB MSE estimate, \(MSE_{test} = \frac{1}{n} \sum_{i=1}^n OOB_i\)

Boston data example

Suggests you just need a couple of hundred trees!

Boston data example

Random forests

  • Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees
  • As in bagging, we build a number of decision trees on bootstrapped training samples
  • But when building these decision trees, each time a split in a tree is considered, a random subset of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors. The split is allowed to use only one of those \(m\) predictors
  • Typical choice for \(m\) is \(m = \sqrt{p}\)

Note: Bagging is Random Forests with \(m = p\).

Boston data example

Boosting trees

Boosting trees

Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Boosting uses many trees, too. However, the trees are grown sequentially: each tree is grown using information from previously grown trees.

  • Fit the data with a single tree
  • Crush the fit so that it does not work very well
  • Look at the part of \(y\) not captured by the crushed tree and fit a new tree to what is “left over”
  • Crush the new tree. Your new fit is the sum of the two trees
  • Repeat the above steps iteratively. At each iteration you fit “what is left over” with a tree, crush the tree, and then add the new crushed tree into the fit
  • Your final fit is the sum of many trees

Boosting algorithm

  1. Set \(\hat{f}(x) = 0\) and \(r_i = y_i\) for all \(i\) in the training set
  2. For \(b = 1, 2, \dots, B\) repeat
    • Fit a tree \(\hat{f}^{b}\) with \(d\) splits (\(d + 1\) terminal nodes) to the training data \((X, r)\)
    • Update \(\hat{f}\) by adding in a shrunken version of the new tree \(\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^{b}(x)\)
    • Update the residuals \(r_i \leftarrow r_i - \lambda \hat{f}^{b}(x_i)\)
  3. Output the boosted model \[\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^{b}(x)\]

Why does it work?

  • Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly
  • Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals
  • Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter \(d\) in the algorithm
  • By fitting small trees to the residuals, we slowly improve \(\hat{f}\) in areas where it does not perform well. The shrinkage parameter \(\lambda\) slows the process down even further, allowing more and different shaped trees to attack the residuals

Choice of parameters

We have to choose:

  • \(\lambda\), shrinkage parameter: it makes each new tree a weak learner in the sense that is only does a little more fitting
  • \(B\), number of interations (the number of trees in the sum). Unlike bagging and random forests, boosting can overfit if \(B\) is too large. We use cross-validation to select \(B\)
  • \(d\), the size of each new tree, which controls the complexity of the boosted ensemble. Often \(d = 1\) works well, in which case each tree is a stump, consisting of a single split and resulting in an additive model

Boosting for classification trees

Boosting for categorical \(y\) works in an analogous manner but it is more complicated in how you define what is “left over”.

  • We cannot simply use the residuals
  • We cannot just add up the fit

Boston data example

Boston data example

## [1] 571

Variable Importance measures

Variable importance measure

Ensemble methods can give dramatically better fits than simple trees.

  • Out-of-sample, they can work amazingly well
  • However, they are certainly not interpretable (you cannot look at hundreds or thousands of trees)

By computing summary measures, you can get some sense of how the trees work. In particular, we are often interested in which variables in \(x\) are really the “important” ones.

Variable importance measure

  • For a single tree, we can look at the splits (decision rules) and pick out the ones that use a particular variable. Then we can add up the reduction in RSS due to the splits using the variable. A large value indicates an important predictor
  • For bagging/random forests we can average the effect of a variable over the \(B\) trees
  • For Boosting we can sum the effects
  • Similarly, classification trees, we consider the total amount that the Gini index is decreased by splits over a given predictor

Variable importance measure

##             var    rel.inf
## lstat     lstat 32.9769020
## rm           rm 31.7265743
## dis         dis  9.6760567
## nox         nox  5.1820613
## crim       crim  4.9207831
## black     black  3.8421932
## ptratio ptratio  3.7491902
## age         age  3.1540710
## tax         tax  1.6917150
## chas       chas  1.0245005
## rad         rad  1.0180367
## indus     indus  0.9034953
## zn           zn  0.1344207

Recap

  • Decision trees are simple and interpretable models for regression and classification
  • However they are often not competitive with other methods in terms of prediction accuracy
  • Bagging, random forests and boosting are good methods for improving the prediction accuracy of trees. They work by growing many trees on the training data and then combining the predictions of the resulting ensemble of trees
  • The latter two methods - random forests and boosting - are among the state-of-the-art methods for supervised learning
  • However their results can be difficult to interpret

Question time